Introduction to R Programming
Part II: Data Manipulation
Welcome
Welcome to the data manipulation part of the Intro to R Programming Workshop!
Data frames
Data frames are the main R object that we will be interacting with. In many ways you already know about them too.
An example for a data frame would be the table from the Animal Ageing and Longevity Database we already saw earlier.
| Animal | Maximum Longevity (in years) |
|---|---|
| Human | 122.5 |
| Domestic dog | 24.0 |
| Domestic cat | 30.0 |
| American alligator | 77.0 |
| Golden hamster | 3.9 |
| King penguin | 26.0 |
| Lion | 27.0 |
| Greenland shark | 392.0 |
| Galapagos tortoise | 177.0 |
| African bush elephant | 65.0 |
| California sea lion | 35.7 |
| Fruit fly | 0.3 |
| House mouse | 4.0 |
| Giraffe | 39.5 |
| Wild boar | 27.0 |
human_lifespan <- 122.5
dog_lifespan <- 24
lion_lifespan <- 27
mouse_lifespan <- 4
fly_lifespan <- 0.3
boar_lifespan <- 27
alligator_lifespan <- 77
greenland_shark_lifespan <- 392
galapagos_tortoise_lifespan <- 177
animal_lifespans <- c(greenland_shark_lifespan, dog_lifespan,
galapagos_tortoise_lifespan,
mouse_lifespan, fly_lifespan,
lion_lifespan, boar_lifespan,
alligator_lifespan, human_lifespan)
animals <- c("greenland_shark", "dog",
"galapagos_tortoise", "mouse",
"fly", "lion", "boar",
"alligator", "human")To create a data frame from scratch we can simply pass two
(same-sized) vectors to the function data.frame.
data.frame(animals, animal_lifespans)## animals animal_lifespans
## 1 greenland_shark 392.0
## 2 dog 24.0
## 3 galapagos_tortoise 177.0
## 4 mouse 4.0
## 5 fly 0.3
## 6 lion 27.0
## 7 boar 27.0
## 8 alligator 77.0
## 9 human 122.5
We can also assign data frames.
animals_data <- data.frame(animals, animal_lifespans)
animals_data## animals animal_lifespans
## 1 greenland_shark 392.0
## 2 dog 24.0
## 3 galapagos_tortoise 177.0
## 4 mouse 4.0
## 5 fly 0.3
## 6 lion 27.0
## 7 boar 27.0
## 8 alligator 77.0
## 9 human 122.5
Data Dimensions
We can use functions to determine the shape of our data.
How many columns does the data have?
We can simply use the function ncol() to determine the
number of columns.
ncol(animals_data)## [1] 2
How many rows does the data have?
Run nrow() to determine the number of rows.
nrow(animals_data)## [1] 9
dim()
We can also use dim() to get the same information in one
call:
dim(animals_data)## [1] 9 2
1st value counts the rows, 2nd value counts the columns.
Variable Names
We can also retrieve the variable names of any data frame by passing
it to names().
names(animals_data)## [1] "animals" "animal_lifespans"
Retrieve variables
If we want to retrieve specific variables from a data frame we can do
that via the $ operator.
\[\color{red}{\text{dataset}}$\color{orange}{\text{variable_name}}\]
Think of the $ symbol as a door opener that helps you
check what is inside an object.
animals_data$animal_lifespans## [1] 392.0 24.0 177.0 4.0 0.3 27.0 27.0 77.0 122.5
animals_data$animals## [1] "greenland_shark" "dog" "galapagos_tortoise"
## [4] "mouse" "fly" "lion"
## [7] "boar" "alligator" "human"
(Re-)Code variables
We can also use the $ data access to add new
variables.
In the below case we create a variable called
animal_to_human which holds all the human to animal years
conversions.
We do that by simply assigning a vector containing that information
to animals_data$animal_to_human even if that variable
doesn’t exist yet.
animals_data$animal_to_human <- animals_data$animal_lifespans / human_lifespananimals_data## animals animal_lifespans animal_to_human
## 1 greenland_shark 392.0 3.20000000
## 2 dog 24.0 0.19591837
## 3 galapagos_tortoise 177.0 1.44489796
## 4 mouse 4.0 0.03265306
## 5 fly 0.3 0.00244898
## 6 lion 27.0 0.22040816
## 7 boar 27.0 0.22040816
## 8 alligator 77.0 0.62857143
## 9 human 122.5 1.00000000
Indexing
Just as we did before with vectors we can also index data frames with
square brackets: []. However, unlike vectors, data frames
have two dimensions.
So that is why the square brackets in this case take two inputs, separated by a comma:
\[\color{red}{\text{dataset}}[\color{orange}{\text{rows}},\color{lightblue}{\text{columns}}]\]
The first value after the opening square bracket refers to \(\color{orange}{\text{which rows}}\) you want to keep.
The second value refers to \(\color{lightblue}{\text{which columns}}\) you want to keep.
So if we only want to keep the first row of the first column of our
animals_data that is how we would do that:
animals_data[1, 1]## [1] "greenland_shark"
If we want to keep a certain row but all columns we can do this by leaving the second value within the square brackets empty.
animals_data[1, ]## animals animal_lifespans animal_to_human
## 1 greenland_shark 392 3.2
The same works for columns but keep all rows.
This actually returns a vector:
animals_data[, 1]## [1] "greenland_shark" "dog" "galapagos_tortoise"
## [4] "mouse" "fly" "lion"
## [7] "boar" "alligator" "human"
Indexing with logical tests
We can also do more complex indexing by keeping only the rows that fulfill a certain condition. Let’s say we only want to keep the rows that contain animals that have longer lifespans than humans.
animals_data$animal_lifespans > human_lifespan## [1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
animals_data[animals_data$animal_lifespans > human_lifespan, ]## animals animal_lifespans animal_to_human
## 1 greenland_shark 392 3.200000
## 3 galapagos_tortoise 177 1.444898
R Packages
Packages are at the heart of R:
R packages are basically a collection of functions that you load into your working environment.
They contain code that other R users have prepared for the community.
It’s good to know your packages, they can really make your life easier.
I suggest keeping track of package developments either on Twitter via #rstats
You can install packages in R like this using the
install.packages function:
install.packages("janitor")However, installing is not enough. You also need to load the package
via library.
library(janitor)## Warning: package 'janitor' was built under R version 4.1.3
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
Think of install.packages as buying a set of tools (for
free!) and library as pulling out the tools each time you
want to work with them.
The Tidyverse
What is the tidyverse?
The tidyverse describes itself:
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
Core principle: tidy data
- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.
We have already seen tidy data:
| Animal | Maximum Lifespan | Animal/Human Years Ratio |
|---|---|---|
| Domestic dog | 24.0 | 5.10 |
| Domestic cat | 30.0 | 4.08 |
| American alligator | 77.0 | 1.59 |
| Golden hamster | 3.9 | 31.41 |
| King penguin | 26.0 | 4.71 |
Untidy data I
| Animal | Type | Value |
|---|---|---|
| Domestic dog | lifespan | 24.0 |
| Domestic dog | ratio | 5.10 |
| Domestic cat | lifespan | 30.0 |
| Domestic cat | ratio | 4.08 |
| American alligator | lifespan | 77.0 |
| American alligator | ratio | 1.59 |
| Golden hamster | lifespan | 3.9 |
| Golden hamster | ratio | 31.41 |
| King penguin | lifespan | 26.0 |
| King penguin | ratio | 4.71 |
The data above has multiple rows with the same observation (animal).
= not tidy
Untidy data II
| Animal | Lifespan/Ratio |
|---|---|
| Domestic dog | 24.0 / 5.10 |
| Domestic cat | 30.0 / 4.08 |
| American alligator | 77.0 / 1.59 |
| Golden hamster | 3.9 / 31.41 |
| King penguin | 26.0 / 4.71 |
The data above has multiple variables per column.
= not tidy
Core principle: tidy data
Artist: Allison Horst
Tidy data has two decisive advantages:
Consistently prepared data is easier to read, process, load and save.
Many procedures (or the associated functions) in R require this type of data.
Artist: Allison Horst
Installing and loading the tidyverse
First we install the packages of the tidyverse like this. In Google
Colab we actually don’t need to install the tidyverse
because it comes pre-installed!
install.packages("tidyverse")Then we load them:
library(tidyverse)## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.7 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'dplyr' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
A new data set appears..
We are going to work with a new data from here on out.
No worries, we will stay within the animal kingdom but we need a data set that is a little more complex than what we have seen already.
Meet the Palmer Station penguins!
Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER.
Artist: Allison Horst
Palmer Penguins
We could install the R package palmerpenguins and then
access the data.
However, we are going to use a different method: directly load a .csv file (comma-separated values) into R from the internet.
We can use the readr package which provides many
convenient functions to load data into R. Here we need
read_csv:
penguins_raw <- read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins_raw.csv")## Warning in gzfile(file, mode): cannot open compressed file 'C:/Users/favoo/
## AppData/Local/Temp/Rtmpm6bE9l\file7288d5d4f67', probable reason 'No such file or
## directory'
##
## -- Column specification --------------------------------------------------------
## cols(
## studyName = col_character(),
## `Sample Number` = col_double(),
## Species = col_character(),
## Region = col_character(),
## Island = col_character(),
## Stage = col_character(),
## `Individual ID` = col_character(),
## `Clutch Completion` = col_character(),
## `Date Egg` = col_date(format = ""),
## `Culmen Length (mm)` = col_double(),
## `Culmen Depth (mm)` = col_double(),
## `Flipper Length (mm)` = col_double(),
## `Body Mass (g)` = col_double(),
## Sex = col_character(),
## `Delta 15 N (o/oo)` = col_double(),
## `Delta 13 C (o/oo)` = col_double(),
## Comments = col_character()
## )
penguins_raw## # A tibble: 344 x 17
## studyName `Sample Number` Species Region Island Stage `Individual ID`
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie Penguin~ Anvers Torge~ Adul~ N1A1
## 2 PAL0708 2 Adelie Penguin~ Anvers Torge~ Adul~ N1A2
## 3 PAL0708 3 Adelie Penguin~ Anvers Torge~ Adul~ N2A1
## 4 PAL0708 4 Adelie Penguin~ Anvers Torge~ Adul~ N2A2
## 5 PAL0708 5 Adelie Penguin~ Anvers Torge~ Adul~ N3A1
## 6 PAL0708 6 Adelie Penguin~ Anvers Torge~ Adul~ N3A2
## 7 PAL0708 7 Adelie Penguin~ Anvers Torge~ Adul~ N4A1
## 8 PAL0708 8 Adelie Penguin~ Anvers Torge~ Adul~ N4A2
## 9 PAL0708 9 Adelie Penguin~ Anvers Torge~ Adul~ N5A1
## 10 PAL0708 10 Adelie Penguin~ Anvers Torge~ Adul~ N5A2
## # ... with 334 more rows, and 10 more variables: `Clutch Completion` <chr>,
## # `Date Egg` <date>, `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
## # `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
## # `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>
take a glimpse
We can also take a look at data set using the glimpse
function from dplyr.
glimpse(penguins_raw)## Rows: 344
## Columns: 17
## $ studyName <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL~
## $ `Sample Number` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1~
## $ Species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie P~
## $ Region <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"~
## $ Island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgerse~
## $ Stage <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adu~
## $ `Individual ID` <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", ~
## $ `Clutch Completion` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", ~
## $ `Date Egg` <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16,~
## $ `Culmen Length (mm)` <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34~
## $ `Culmen Depth (mm)` <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18~
## $ `Flipper Length (mm)` <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190,~
## $ `Body Mass (g)` <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 34~
## $ Sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE"~
## $ `Delta 15 N (o/oo)` <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18~
## $ `Delta 13 C (o/oo)` <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.298~
## $ Comments <chr> "Not enough blood for isotopes.", NA, NA, "Adult~
initial data cleaning using janitor
janitor is not offically part of the tidyverse package
compilation but in my view it is incredibly important to know.
Provides some convenient functions for basic cleaning of the data.
Just like any tidverse-style package it fullfills the following criteria for its functions:
The data is always the first argument.
This helps us to match by position.
install.packages("janitor")library(janitor)clean_names()
One annoyance with the penguins_raw data is that it has
spaces in the variable names. Urgh!
R has to put quotes around the variable names that have spaces:
penguins_raw$`Delta 15 N (o/oo)`## [1] NA 8.94956 8.36821 NA 8.76651 8.66496 9.18718 9.46060
## [9] NA 9.13362 8.63243 NA NA NA 8.55583 NA
## [17] 9.18528 8.67538 8.47827 9.11616 8.73762 8.66271 9.22286 8.43423
## [25] 9.63954 9.21292 8.93997 8.08138 8.38404 8.90027 9.69756 9.72764
## [33] 9.66523 8.79665 9.17847 9.15308 9.18985 8.04787 9.41131 NA
## [41] 9.68933 NA 9.50772 9.23720 9.36392 9.49106 NA NA
## [49] 9.51784 8.87988 8.46616 8.51362 8.19539 8.48095 8.41837 8.35396
## [57] 8.57199 8.56674 9.07878 9.10800 8.96472 8.74802 8.58063 8.62264
## [65] 8.62623 8.85562 8.56192 8.71078 8.47781 8.86853 7.88863 9.29808
## [73] 8.33524 8.18658 8.70642 8.29930 8.47257 8.35540 7.82381 9.05736
## [81] 7.69778 8.63259 7.88494 8.90002 8.32718 9.14863 8.57087 8.59147
## [89] 9.07826 8.36936 8.46531 8.77018 8.01485 8.49915 8.90723 8.48204
## [97] 8.10277 8.39459 9.04218 8.97025 8.84451 9.01079 9.21510 9.51929
## [105] 9.02642 8.85699 8.77322 9.59245 9.79532 9.31735 8.43951 8.65466
## [113] 9.02657 8.80186 8.80967 8.91434 9.18021 9.49645 8.96436 9.32277
## [121] 9.04296 9.11066 9.30722 9.59462 8.81668 9.22537 8.88098 8.52566
## [129] 9.19031 9.10702 8.98460 8.86495 8.98705 8.56708 8.71700 8.94365
## [137] 8.75984 8.95998 8.61651 9.25769 9.28810 9.23408 8.79787 9.05674
## [145] 9.06829 9.22033 9.11006 8.68744 8.94332 8.97533 8.93465 8.89640
## [153] 7.99300 8.14756 8.14705 8.25540 8.23450 7.99530 8.24515 8.22673
## [161] 8.13643 8.16310 8.19579 8.10417 7.77672 7.82080 7.79958 8.07137
## [169] 7.63884 8.27376 7.84057 7.96491 7.89620 7.63220 7.90436 7.90971
## [177] 7.68528 7.83733 7.96621 7.92358 7.68870 8.30515 NA 7.63452
## [185] 7.97408 7.76843 7.89744 8.03659 7.96935 8.13746 8.01979 8.14776
## [193] 8.14567 8.38324 8.37615 8.26548 8.46894 8.27141 8.47829 8.65803
## [201] 8.45167 8.55868 8.38289 8.39867 8.51951 8.50153 8.48789 8.63488
## [209] 8.58319 8.63604 8.48367 8.74647 8.65015 8.60092 8.62870 8.49662
## [217] 8.60447 8.47067 8.24253 8.49854 8.64931 8.63551 8.53018 8.35078
## [225] 8.24651 8.58487 8.47938 8.59640 8.39299 8.40327 8.24694 8.19749
## [233] 8.35802 8.28601 8.19101 8.20042 8.11238 8.27428 8.23468 8.15426
## [241] 8.12691 8.27595 8.29671 8.36701 8.15566 8.83352 8.20106 8.27102
## [249] 8.03624 7.88810 8.16582 8.20660 8.10231 8.31180 8.30817 8.65914
## [257] 8.25818 8.32359 8.12311 8.41017 8.42070 8.45738 8.24691 8.29226
## [265] 8.21634 8.78557 8.30231 8.08354 8.04111 8.33825 7.99184 NA
## [273] 8.41151 8.30166 8.24246 8.36390 9.03935 8.92069 9.29078 8.64701
## [281] 9.00642 8.88942 8.85664 8.63701 8.47173 8.79581 8.95063 8.68747
## [289] 8.72037 9.02330 9.12277 9.80590 10.02019 9.14382 9.32105 9.27158
## [297] 9.35138 9.42666 9.35416 9.28153 9.74144 9.36799 8.93990 9.63074
## [305] 9.37369 9.25177 9.08458 9.49283 9.36668 9.23196 9.75486 9.07825
## [313] 8.83502 9.43146 9.80589 10.02544 9.53262 9.61734 10.02372 9.36493
## [321] 9.43684 9.45827 9.46819 9.34089 9.68950 9.32169 9.46929 9.43782
## [329] 9.41500 9.93727 9.56534 9.77528 9.62357 9.88809 9.74492 9.46985
## [337] NA 9.65061 9.26715 9.70465 9.37608 9.46180 9.98044 9.39305
penguins_raw$`Flipper Length (mm)`## [1] 181 186 195 NA 193 190 181 195 193 190 186 180 182 191 198 185 195 197
## [19] 184 194 174 180 189 185 180 187 183 187 172 180 178 178 188 184 195 196
## [37] 190 180 181 184 182 195 186 196 185 190 182 179 190 191 186 188 190 200
## [55] 187 191 186 193 181 194 185 195 185 192 184 192 195 188 190 198 190 190
## [73] 196 197 190 195 191 184 187 195 189 196 187 193 191 194 190 189 189 190
## [91] 202 205 185 186 187 208 190 196 178 192 192 203 183 190 193 184 199 190
## [109] 181 197 198 191 193 197 191 196 188 199 189 189 187 198 176 202 186 199
## [127] 191 195 191 210 190 197 193 199 187 190 191 200 185 193 193 187 188 190
## [145] 192 185 190 184 195 193 187 201 211 230 210 218 215 210 211 219 209 215
## [163] 214 216 214 213 210 217 210 221 209 222 218 215 213 215 215 215 216 215
## [181] 210 220 222 209 207 230 220 220 213 219 208 208 208 225 210 216 222 217
## [199] 210 225 213 215 210 220 210 225 217 220 208 220 208 224 208 221 214 231
## [217] 219 230 214 229 220 223 216 221 221 217 216 230 209 220 215 223 212 221
## [235] 212 224 212 228 218 218 212 230 218 228 212 224 214 226 216 222 203 225
## [253] 219 228 215 228 216 215 210 219 208 209 216 229 213 230 217 230 217 222
## [271] 214 NA 215 222 212 213 192 196 193 188 197 198 178 197 195 198 193 194
## [289] 185 201 190 201 197 181 190 195 181 191 187 193 195 197 200 200 191 205
## [307] 187 201 187 203 195 199 195 210 192 205 210 187 196 196 196 201 190 212
## [325] 187 198 199 201 193 203 187 197 191 203 202 194 206 189 195 207 202 193
## [343] 210 198
janitor can help with that:
using a function called clean_names()
clean_names() just magically turns all our messy column
names into readable lower-case snake case:
penguins_clean <- clean_names(penguins_raw)That is how the variables look like now:
glimpse(penguins_clean)## Rows: 344
## Columns: 17
## $ study_name <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708~
## $ sample_number <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1~
## $ species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu~
## $ region <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A~
## $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", ~
## $ stage <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, ~
## $ individual_id <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A~
## $ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"~
## $ date_egg <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200~
## $ culmen_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ culmen_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F~
## $ delta_15_n_o_oo <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18718,~
## $ delta_13_c_o_oo <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805, ~
## $ comments <chr> "Not enough blood for isotopes.", NA, NA, "Adult not~
remove_constant()
Now we have another problem. Not all variables in the
penguins_clean data set are that useful.
Some of them are the same across all observations. We don’t need
those variables, like region.
table(penguins_clean$region)##
## Anvers
## 344
We can use the base R function table to quickly get some
tabulations of our variable.
Here to help get rid of these constant columns is the
function remove_constant().
penguins_clean <- remove_constant(penguins_clean, quiet = F)## Removing 2 constant columns of 17 columns total (Removed: region, stage).
When we set quiet = F we even get some info about what
exactly was removed. Neat!
Another useful function in janitor is
remove_empty() which removes all rows or columns that just
consist of missing values (i.e. NA)
Data cleaning using tidyr
Now we are already fairly advanced in our tidying.
But our dataset is still not entirely tidy yet.
Consider the species variable:
table(penguins_clean$species)##
## Adelie Penguin (Pygoscelis adeliae)
## 152
## Chinstrap penguin (Pygoscelis antarctica)
## 68
## Gentoo penguin (Pygoscelis papua)
## 124
This variable violates the tidy rule that each cell should include a single value.
Species hold both the common name and the latin name of the penguin.
separate()
We can use a tidyr function called
separate() to turn this into two variables.
Two arguments are important for that:
sep: specifies by which character the value should be splitinto: a vector which specifies the resulting new variable names
In our case we want to split by an empty space and opening bracket
\\( and will name our variables species and
latin_name:
penguins_clean <- separate(penguins_clean, species, sep = " \\(", into = c("species", "latin_name"))penguins_clean## # A tibble: 344 x 16
## study_name sample_number species latin_name island individual_id
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie Penguin Pygoscelis adel~ Torge~ N1A1
## 2 PAL0708 2 Adelie Penguin Pygoscelis adel~ Torge~ N1A2
## 3 PAL0708 3 Adelie Penguin Pygoscelis adel~ Torge~ N2A1
## 4 PAL0708 4 Adelie Penguin Pygoscelis adel~ Torge~ N2A2
## 5 PAL0708 5 Adelie Penguin Pygoscelis adel~ Torge~ N3A1
## 6 PAL0708 6 Adelie Penguin Pygoscelis adel~ Torge~ N3A2
## 7 PAL0708 7 Adelie Penguin Pygoscelis adel~ Torge~ N4A1
## 8 PAL0708 8 Adelie Penguin Pygoscelis adel~ Torge~ N4A2
## 9 PAL0708 9 Adelie Penguin Pygoscelis adel~ Torge~ N5A1
## 10 PAL0708 10 Adelie Penguin Pygoscelis adel~ Torge~ N5A2
## # ... with 334 more rows, and 10 more variables: clutch_completion <chr>,
## # date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## # flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## # delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>
Now there is still a trailing ) at the end of
latin_name. We can remove that using the
stringr package and more specifically the
str_remove() function.
penguins_clean$latin_name <- str_remove(penguins_clean$latin_name, "\\)")penguins_clean## # A tibble: 344 x 16
## study_name sample_number species latin_name island individual_id
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie Penguin Pygoscelis adel~ Torge~ N1A1
## 2 PAL0708 2 Adelie Penguin Pygoscelis adel~ Torge~ N1A2
## 3 PAL0708 3 Adelie Penguin Pygoscelis adel~ Torge~ N2A1
## 4 PAL0708 4 Adelie Penguin Pygoscelis adel~ Torge~ N2A2
## 5 PAL0708 5 Adelie Penguin Pygoscelis adel~ Torge~ N3A1
## 6 PAL0708 6 Adelie Penguin Pygoscelis adel~ Torge~ N3A2
## 7 PAL0708 7 Adelie Penguin Pygoscelis adel~ Torge~ N4A1
## 8 PAL0708 8 Adelie Penguin Pygoscelis adel~ Torge~ N4A2
## 9 PAL0708 9 Adelie Penguin Pygoscelis adel~ Torge~ N5A1
## 10 PAL0708 10 Adelie Penguin Pygoscelis adel~ Torge~ N5A2
## # ... with 334 more rows, and 10 more variables: clutch_completion <chr>,
## # date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## # flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## # delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>
There is a also a function called unite() which works in
the opposite direction.
Now our data is in tidy format!
We were in luck because the data pretty much already came in a format that was: 1 observation per row.
But what if that is not the case?
pivot_wider() and pivot_longer()
tidyr also comes equipped to deal with data that has
more that one observation per row.
The function to use here is called pivot_wider.
Now our penguin_clean data is already tidy.
But we can just read in a dataset that isn’t:
untidy_animals <- read_csv("https://github.com/favstats/ds3_r_intro/blob/main/data/untidy_animals.csv?raw=true")##
## -- Column specification --------------------------------------------------------
## cols(
## Animal = col_character(),
## Type = col_character(),
## Value = col_double()
## )
untidy_animals## # A tibble: 10 x 3
## Animal Type Value
## <chr> <chr> <dbl>
## 1 Domestic dog lifespan 24
## 2 Domestic dog ratio 5.1
## 3 Domestic cat lifespan 30
## 4 Domestic cat ratio 4.08
## 5 American alligator lifespan 77
## 6 American alligator ratio 1.59
## 7 Golden hamster lifespan 3.9
## 8 Golden hamster ratio 31.4
## 9 King penguin lifespan 26
## 10 King penguin ratio 4.71
You may recognize this data from the subsection Untidy data I
Now let’s use pivot_wider to make every row an
observation.
We need two main arguments for that:
names_from: tells the function where the new column names come fromvalues_from: tells the function where the values should come from
tidy_animals <- pivot_wider(untidy_animals, names_from = Type, values_from = Value)
tidy_animals## # A tibble: 5 x 3
## Animal lifespan ratio
## <chr> <dbl> <dbl>
## 1 Domestic dog 24 5.1
## 2 Domestic cat 30 4.08
## 3 American alligator 77 1.59
## 4 Golden hamster 3.9 31.4
## 5 King penguin 26 4.71
pivot_longer can untidy our data again
The argument cols = tells the function which variables
to turn into long format:
pivot_longer(tidy_animals, cols = c(lifespan, ratio))## # A tibble: 10 x 3
## Animal name value
## <chr> <chr> <dbl>
## 1 Domestic dog lifespan 24
## 2 Domestic dog ratio 5.1
## 3 Domestic cat lifespan 30
## 4 Domestic cat ratio 4.08
## 5 American alligator lifespan 77
## 6 American alligator ratio 1.59
## 7 Golden hamster lifespan 3.9
## 8 Golden hamster ratio 31.4
## 9 King penguin lifespan 26
## 10 King penguin ratio 4.71
Data manipulation using dplyr
Artist: Allison Horst
select()
helps you select variables
select() is part of the dplyr package and helps you
select variables
Remember: with tidyverse-style functions, data is always the first argument.
Select variables
Here we only keep individual_id, sex and
species.
select(penguins_clean, individual_id, sex, species)## # A tibble: 344 x 3
## individual_id sex species
## <chr> <chr> <chr>
## 1 N1A1 MALE Adelie Penguin
## 2 N1A2 FEMALE Adelie Penguin
## 3 N2A1 FEMALE Adelie Penguin
## 4 N2A2 <NA> Adelie Penguin
## 5 N3A1 FEMALE Adelie Penguin
## 6 N3A2 MALE Adelie Penguin
## 7 N4A1 FEMALE Adelie Penguin
## 8 N4A2 MALE Adelie Penguin
## 9 N5A1 <NA> Adelie Penguin
## 10 N5A2 <NA> Adelie Penguin
## # ... with 334 more rows
But select() is more powerful than that.
Remove variables
We can also remove variables with a
- (minus).
Here we remove individual_id, sex and
species.
select(penguins_clean, -individual_id, -sex, -species)## # A tibble: 344 x 13
## study_name sample_number latin_name island clutch_completi~ date_egg
## <chr> <dbl> <chr> <chr> <chr> <date>
## 1 PAL0708 1 Pygoscelis adeli~ Torge~ Yes 2007-11-11
## 2 PAL0708 2 Pygoscelis adeli~ Torge~ Yes 2007-11-11
## 3 PAL0708 3 Pygoscelis adeli~ Torge~ Yes 2007-11-16
## 4 PAL0708 4 Pygoscelis adeli~ Torge~ Yes 2007-11-16
## 5 PAL0708 5 Pygoscelis adeli~ Torge~ Yes 2007-11-16
## 6 PAL0708 6 Pygoscelis adeli~ Torge~ Yes 2007-11-16
## 7 PAL0708 7 Pygoscelis adeli~ Torge~ No 2007-11-15
## 8 PAL0708 8 Pygoscelis adeli~ Torge~ No 2007-11-15
## 9 PAL0708 9 Pygoscelis adeli~ Torge~ Yes 2007-11-09
## 10 PAL0708 10 Pygoscelis adeli~ Torge~ Yes 2007-11-09
## # ... with 334 more rows, and 7 more variables: culmen_length_mm <dbl>,
## # culmen_depth_mm <dbl>, flipper_length_mm <dbl>, body_mass_g <dbl>,
## # delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>
Selection helpers
These selection helpers match variables according to a given pattern.
starts_with(): Starts with a prefix.
ends_with(): Ends with a suffix.
contains(): Contains a literal string.
matches(): Matches a regular expression.
For example: let’s keep all variables that start with
s:
select(penguins_clean, starts_with("s"))## # A tibble: 344 x 4
## study_name sample_number species sex
## <chr> <dbl> <chr> <chr>
## 1 PAL0708 1 Adelie Penguin MALE
## 2 PAL0708 2 Adelie Penguin FEMALE
## 3 PAL0708 3 Adelie Penguin FEMALE
## 4 PAL0708 4 Adelie Penguin <NA>
## 5 PAL0708 5 Adelie Penguin FEMALE
## 6 PAL0708 6 Adelie Penguin MALE
## 7 PAL0708 7 Adelie Penguin FEMALE
## 8 PAL0708 8 Adelie Penguin MALE
## 9 PAL0708 9 Adelie Penguin <NA>
## 10 PAL0708 10 Adelie Penguin <NA>
## # ... with 334 more rows
Even more ways to select
Select the first 5 variables:
select(penguins_clean, 1:5)## # A tibble: 344 x 5
## study_name sample_number species latin_name island
## <chr> <dbl> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie Penguin Pygoscelis adeliae Torgersen
## 2 PAL0708 2 Adelie Penguin Pygoscelis adeliae Torgersen
## 3 PAL0708 3 Adelie Penguin Pygoscelis adeliae Torgersen
## 4 PAL0708 4 Adelie Penguin Pygoscelis adeliae Torgersen
## 5 PAL0708 5 Adelie Penguin Pygoscelis adeliae Torgersen
## 6 PAL0708 6 Adelie Penguin Pygoscelis adeliae Torgersen
## 7 PAL0708 7 Adelie Penguin Pygoscelis adeliae Torgersen
## 8 PAL0708 8 Adelie Penguin Pygoscelis adeliae Torgersen
## 9 PAL0708 9 Adelie Penguin Pygoscelis adeliae Torgersen
## 10 PAL0708 10 Adelie Penguin Pygoscelis adeliae Torgersen
## # ... with 334 more rows
Select everything from individual_id to
flipper_length_mm.
select(penguins_clean, individual_id:flipper_length_mm)## # A tibble: 344 x 6
## individual_id clutch_completion date_egg culmen_length_mm culmen_depth_mm
## <chr> <chr> <date> <dbl> <dbl>
## 1 N1A1 Yes 2007-11-11 39.1 18.7
## 2 N1A2 Yes 2007-11-11 39.5 17.4
## 3 N2A1 Yes 2007-11-16 40.3 18
## 4 N2A2 Yes 2007-11-16 NA NA
## 5 N3A1 Yes 2007-11-16 36.7 19.3
## 6 N3A2 Yes 2007-11-16 39.3 20.6
## 7 N4A1 No 2007-11-15 38.9 17.8
## 8 N4A2 No 2007-11-15 39.2 19.6
## 9 N5A1 Yes 2007-11-09 34.1 18.1
## 10 N5A2 Yes 2007-11-09 42 20.2
## # ... with 334 more rows, and 1 more variable: flipper_length_mm <dbl>
filter()
helps you filter rows
Here we only keep penguins from the Island Dream.
filter(penguins_clean, island == "Dream")## # A tibble: 124 x 16
## study_name sample_number species latin_name island individual_id
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 PAL0708 31 Adelie Penguin Pygoscelis adel~ Dream N21A1
## 2 PAL0708 32 Adelie Penguin Pygoscelis adel~ Dream N21A2
## 3 PAL0708 33 Adelie Penguin Pygoscelis adel~ Dream N22A1
## 4 PAL0708 34 Adelie Penguin Pygoscelis adel~ Dream N22A2
## 5 PAL0708 35 Adelie Penguin Pygoscelis adel~ Dream N23A1
## 6 PAL0708 36 Adelie Penguin Pygoscelis adel~ Dream N23A2
## 7 PAL0708 37 Adelie Penguin Pygoscelis adel~ Dream N24A1
## 8 PAL0708 38 Adelie Penguin Pygoscelis adel~ Dream N24A2
## 9 PAL0708 39 Adelie Penguin Pygoscelis adel~ Dream N25A1
## 10 PAL0708 40 Adelie Penguin Pygoscelis adel~ Dream N25A2
## # ... with 114 more rows, and 10 more variables: clutch_completion <chr>,
## # date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## # flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## # delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>
%in%
Here the %in% operator can come in
handy again if we want to filter more than one island:
islands_to_keep <- c("Dream", "Biscoe")
filter(penguins_clean, island %in% islands_to_keep)## # A tibble: 292 x 16
## study_name sample_number species latin_name island individual_id
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 PAL0708 21 Adelie Penguin Pygoscelis adel~ Biscoe N11A1
## 2 PAL0708 22 Adelie Penguin Pygoscelis adel~ Biscoe N11A2
## 3 PAL0708 23 Adelie Penguin Pygoscelis adel~ Biscoe N12A1
## 4 PAL0708 24 Adelie Penguin Pygoscelis adel~ Biscoe N12A2
## 5 PAL0708 25 Adelie Penguin Pygoscelis adel~ Biscoe N13A1
## 6 PAL0708 26 Adelie Penguin Pygoscelis adel~ Biscoe N13A2
## 7 PAL0708 27 Adelie Penguin Pygoscelis adel~ Biscoe N17A1
## 8 PAL0708 28 Adelie Penguin Pygoscelis adel~ Biscoe N17A2
## 9 PAL0708 29 Adelie Penguin Pygoscelis adel~ Biscoe N18A1
## 10 PAL0708 30 Adelie Penguin Pygoscelis adel~ Biscoe N18A2
## # ... with 282 more rows, and 10 more variables: clutch_completion <chr>,
## # date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## # flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## # delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>
mutate()
helps you create variables
mutate will take a statement like this:
variable_name = some_calculation
and attach variable_name at the end of the
dataset.
Let’s say we want to calculate penguin bodymass in kg rather than gram.
We take the variable body_mass_g and divided by
1000.
pg_new <- mutate(penguins_clean, bodymass_kg = body_mass_g/1000)We temporarily assign the dataset to pg_new just to
check whether it worked correctly:
select(pg_new, bodymass_kg, body_mass_g)## # A tibble: 344 x 2
## bodymass_kg body_mass_g
## <dbl> <dbl>
## 1 3.75 3750
## 2 3.8 3800
## 3 3.25 3250
## 4 NA NA
## 5 3.45 3450
## 6 3.65 3650
## 7 3.62 3625
## 8 4.68 4675
## 9 3.48 3475
## 10 4.25 4250
## # ... with 334 more rows
Recoding with ifelse
ifelse() is a very useful function that allows to easily
recode variables based on logical tests.
It’s basic functionality looks like this:
\[\color{red}{\text{ifelse}}(\color{orange}{\text{logical test}},\color{blue}{\text{what should happen if TRUE}}, \color{green}{\text{what should happen if FALSE}})\]
Here is a very basic example:
ifelse(1 == 1, "Pick me if test is TRUE", "Pick me if test is FALSE")## [1] "Pick me if test is TRUE"
ifelse(1 != 1, "Pick me if test is TRUE", "Pick me if test is FALSE")## [1] "Pick me if test is FALSE"
Let’s use ifelse in combination with
mutate.
Let’s create the variable sex_short which has a shorter
label for sex:
pg_new <- mutate(penguins_clean, sex_short = ifelse(sex == "MALE", "m", "f"))We temporarily assign the dataset to pg_new just to
check whether it worked correctly:
select(pg_new, sex, sex_short)## # A tibble: 344 x 2
## sex sex_short
## <chr> <chr>
## 1 MALE m
## 2 FEMALE f
## 3 FEMALE f
## 4 <NA> <NA>
## 5 FEMALE f
## 6 MALE m
## 7 FEMALE f
## 8 MALE m
## 9 <NA> <NA>
## 10 <NA> <NA>
## # ... with 334 more rows
Recoding with case_when
case_when (from the dplyr package) is like
ifelse but allows for much more complex combinations.
The basic setup for a case_when call looks like
this:
case_when(
\(\color{orange}{\text{logical test}}\) ~ \(\color{blue}{\text{what should happen if TRUE}}\),
\(\color{orange}{\text{logical test}}\) ~ \(\color{blue}{\text{what should happen if TRUE}}\),
\(\color{orange}{\text{logical test}}\) ~ \(\color{blue}{\text{what should happen if TRUE}}\),
\(TRUE\) ~ \(\color{green}{\text{what should happen with everything else}}\),
)
The following code recodes a numeric vector (1 through 50) into three categorical ones:
x <- c(1:50)
x## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
case_when(
x %in% 1:10 ~ "1 through 10",
x %in% 11:30 ~ "11 through 30",
TRUE ~ "above 30"
)## [1] "1 through 10" "1 through 10" "1 through 10" "1 through 10"
## [5] "1 through 10" "1 through 10" "1 through 10" "1 through 10"
## [9] "1 through 10" "1 through 10" "11 through 30" "11 through 30"
## [13] "11 through 30" "11 through 30" "11 through 30" "11 through 30"
## [17] "11 through 30" "11 through 30" "11 through 30" "11 through 30"
## [21] "11 through 30" "11 through 30" "11 through 30" "11 through 30"
## [25] "11 through 30" "11 through 30" "11 through 30" "11 through 30"
## [29] "11 through 30" "11 through 30" "above 30" "above 30"
## [33] "above 30" "above 30" "above 30" "above 30"
## [37] "above 30" "above 30" "above 30" "above 30"
## [41] "above 30" "above 30" "above 30" "above 30"
## [45] "above 30" "above 30" "above 30" "above 30"
## [49] "above 30" "above 30"
Let’s use case_when in combination with
mutate.
Creating the variable short_island which has a shorter
label for island:
test <- mutate(penguins_clean,
island_short = case_when(
island == "Torgersen" ~ "T",
island == "Biscoe" ~ "B",
island == "Dream" ~ "D"
))select(test, island, island_short)## # A tibble: 344 x 2
## island island_short
## <chr> <chr>
## 1 Torgersen T
## 2 Torgersen T
## 3 Torgersen T
## 4 Torgersen T
## 5 Torgersen T
## 6 Torgersen T
## 7 Torgersen T
## 8 Torgersen T
## 9 Torgersen T
## 10 Torgersen T
## # ... with 334 more rows
With case_when you can also mix different variables
making this a very powerful tool!
rename()
Just changes the variable name but leaves all else intact:
rename(penguins_clean, sample = sample_number)## # A tibble: 344 x 16
## study_name sample species latin_name island individual_id clutch_completi~
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie Pe~ Pygosceli~ Torge~ N1A1 Yes
## 2 PAL0708 2 Adelie Pe~ Pygosceli~ Torge~ N1A2 Yes
## 3 PAL0708 3 Adelie Pe~ Pygosceli~ Torge~ N2A1 Yes
## 4 PAL0708 4 Adelie Pe~ Pygosceli~ Torge~ N2A2 Yes
## 5 PAL0708 5 Adelie Pe~ Pygosceli~ Torge~ N3A1 Yes
## 6 PAL0708 6 Adelie Pe~ Pygosceli~ Torge~ N3A2 Yes
## 7 PAL0708 7 Adelie Pe~ Pygosceli~ Torge~ N4A1 No
## 8 PAL0708 8 Adelie Pe~ Pygosceli~ Torge~ N4A2 No
## 9 PAL0708 9 Adelie Pe~ Pygosceli~ Torge~ N5A1 Yes
## 10 PAL0708 10 Adelie Pe~ Pygosceli~ Torge~ N5A2 Yes
## # ... with 334 more rows, and 9 more variables: date_egg <date>,
## # culmen_length_mm <dbl>, culmen_depth_mm <dbl>, flipper_length_mm <dbl>,
## # body_mass_g <dbl>, sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>,
## # comments <chr>
arrange()
You can order your data to show the highest or lowest value first.
Let’s order by flipper_length_mm.
Lowest first:
arrange(penguins_clean, flipper_length_mm)## # A tibble: 344 x 16
## study_name sample_number species latin_name island individual_id
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 PAL0708 29 Adelie Penguin Pygoscelis a~ Biscoe N18A1
## 2 PAL0708 21 Adelie Penguin Pygoscelis a~ Biscoe N11A1
## 3 PAL0910 123 Adelie Penguin Pygoscelis a~ Torge~ N67A1
## 4 PAL0708 31 Adelie Penguin Pygoscelis a~ Dream N21A1
## 5 PAL0708 32 Adelie Penguin Pygoscelis a~ Dream N21A2
## 6 PAL0809 99 Adelie Penguin Pygoscelis a~ Dream N50A1
## 7 PAL0708 7 Chinstrap penguin Pygoscelis a~ Dream N66A1
## 8 PAL0708 48 Adelie Penguin Pygoscelis a~ Dream N29A2
## 9 PAL0708 12 Adelie Penguin Pygoscelis a~ Torge~ N6A2
## 10 PAL0708 22 Adelie Penguin Pygoscelis a~ Biscoe N11A2
## # ... with 334 more rows, and 10 more variables: clutch_completion <chr>,
## # date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## # flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## # delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>
Highest first using desc() (for descendant):
arrange(penguins_clean, desc(flipper_length_mm))## # A tibble: 344 x 16
## study_name sample_number species latin_name island individual_id
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 PAL0809 64 Gentoo penguin Pygoscelis papua Biscoe N19A2
## 2 PAL0708 2 Gentoo penguin Pygoscelis papua Biscoe N31A2
## 3 PAL0708 34 Gentoo penguin Pygoscelis papua Biscoe N56A2
## 4 PAL0809 66 Gentoo penguin Pygoscelis papua Biscoe N20A2
## 5 PAL0809 76 Gentoo penguin Pygoscelis papua Biscoe N56A2
## 6 PAL0910 90 Gentoo penguin Pygoscelis papua Biscoe N14A2
## 7 PAL0910 114 Gentoo penguin Pygoscelis papua Biscoe N34A2
## 8 PAL0910 116 Gentoo penguin Pygoscelis papua Biscoe N35A2
## 9 PAL0809 68 Gentoo penguin Pygoscelis papua Biscoe N51A2
## 10 PAL0910 112 Gentoo penguin Pygoscelis papua Biscoe N32A2
## # ... with 334 more rows, and 10 more variables: clutch_completion <chr>,
## # date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## # flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## # delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>
group_by() and summarize()
When you want to aggregate your data (by groups)
Sometimes we want to calculate group statistics.
In other languages this is often a pain.
With dplyr this is fairly easy and
readable.
Let’s calculate the average culmen_length_mm for each
sex.
First we group penguins_clean by
sex.
grouped_by_sex <- group_by(penguins_clean, sex)summarize works in a similar way to
mutate:
variable_name = some_calculation
summarise(grouped_by_sex, avg_culmen_length = mean(culmen_length_mm, na.rm = T))## # A tibble: 3 x 2
## sex avg_culmen_length
## <chr> <dbl>
## 1 FEMALE 42.1
## 2 MALE 45.9
## 3 <NA> 41.3
We could also keep the data structure by using mutate on
a grouped dataset:
mutate(grouped_by_sex, avg_culmen_length = mean(culmen_length_mm, na.rm = T))## # A tibble: 344 x 17
## # Groups: sex [3]
## study_name sample_number species latin_name island individual_id
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie Penguin Pygoscelis adel~ Torge~ N1A1
## 2 PAL0708 2 Adelie Penguin Pygoscelis adel~ Torge~ N1A2
## 3 PAL0708 3 Adelie Penguin Pygoscelis adel~ Torge~ N2A1
## 4 PAL0708 4 Adelie Penguin Pygoscelis adel~ Torge~ N2A2
## 5 PAL0708 5 Adelie Penguin Pygoscelis adel~ Torge~ N3A1
## 6 PAL0708 6 Adelie Penguin Pygoscelis adel~ Torge~ N3A2
## 7 PAL0708 7 Adelie Penguin Pygoscelis adel~ Torge~ N4A1
## 8 PAL0708 8 Adelie Penguin Pygoscelis adel~ Torge~ N4A2
## 9 PAL0708 9 Adelie Penguin Pygoscelis adel~ Torge~ N5A1
## 10 PAL0708 10 Adelie Penguin Pygoscelis adel~ Torge~ N5A2
## # ... with 334 more rows, and 11 more variables: clutch_completion <chr>,
## # date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## # flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## # delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>,
## # avg_culmen_length <dbl>
Once we are done with group_by we should
ungroup our data gain.
next_data <- ungroup(grouped_by_sex)count()
Now this is a function that I use all the time.
This function helps you count how often a certain value occur(s) within variables(s).
Simply specify which variable you want to count.
Let’s count how often the species occur.
count(penguins_clean, species, sort = T) ## # A tibble: 3 x 2
## species n
## <chr> <int>
## 1 Adelie Penguin 152
## 2 Gentoo penguin 124
## 3 Chinstrap penguin 68
The sort = T tells the function to sort by the highest
occuring frequency.
The %>% operator
The point of the pipe is to help you write code in a way that is easier to read and understand.
Let’s consider an example with some data manipulation we have done so far:
## first I select variables
pg <- select(penguins_clean, individual_id, island, body_mass_g)
## then I filter to only Dream island
pg <- filter(pg, island == "Dream")
## then I convert body_mass_g to kg
pg <- mutate(pg, bodymass_kg = body_mass_g/1000)
## rename individual id to simply id
pg <- rename(pg, id = individual_id)Now this works but the problem is: we have to write a lot of code that repeats itself!
pg## # A tibble: 124 x 4
## id island body_mass_g bodymass_kg
## <chr> <chr> <dbl> <dbl>
## 1 N21A1 Dream 3250 3.25
## 2 N21A2 Dream 3900 3.9
## 3 N22A1 Dream 3300 3.3
## 4 N22A2 Dream 3900 3.9
## 5 N23A1 Dream 3325 3.32
## 6 N23A2 Dream 4150 4.15
## 7 N24A1 Dream 3950 3.95
## 8 N24A2 Dream 3550 3.55
## 9 N25A1 Dream 3300 3.3
## 10 N25A2 Dream 4650 4.65
## # ... with 114 more rows
Another alternative is to nest all the functions:
rename(mutate(filter(select(penguins_clean, individual_id, island, body_mass_g), island == "Dream"), bodymass_kg = body_mass_g/1000), id = individual_id)## # A tibble: 124 x 4
## id island body_mass_g bodymass_kg
## <chr> <chr> <dbl> <dbl>
## 1 N21A1 Dream 3250 3.25
## 2 N21A2 Dream 3900 3.9
## 3 N22A1 Dream 3300 3.3
## 4 N22A2 Dream 3900 3.9
## 5 N23A1 Dream 3325 3.32
## 6 N23A2 Dream 4150 4.15
## 7 N24A1 Dream 3950 3.95
## 8 N24A2 Dream 3550 3.55
## 9 N25A1 Dream 3300 3.3
## 10 N25A2 Dream 4650 4.65
## # ... with 114 more rows
But that’s extremely tough to read and understand!
The piping style:
Read from top to bottom and from left to right and the
%>% as “and then”.
Data first, data once
penguins_clean %>%
select(individual_id, island, body_mass_g) %>%
filter(island == "Dream") %>%
mutate(bodymass_kg = body_mass_g/1000) %>%
rename(id = individual_id)## # A tibble: 124 x 4
## id island body_mass_g bodymass_kg
## <chr> <chr> <dbl> <dbl>
## 1 N21A1 Dream 3250 3.25
## 2 N21A2 Dream 3900 3.9
## 3 N22A1 Dream 3300 3.3
## 4 N22A2 Dream 3900 3.9
## 5 N23A1 Dream 3325 3.32
## 6 N23A2 Dream 4150 4.15
## 7 N24A1 Dream 3950 3.95
## 8 N24A2 Dream 3550 3.55
## 9 N25A1 Dream 3300 3.3
## 10 N25A2 Dream 4650 4.65
## # ... with 114 more rows
group_by() again
Grouping also become easier using pipes.
Let’s try again to calculate the average
culmen_length_mm for each sex but this time with pipes.
penguins_clean %>%
group_by(sex) %>%
summarise(avg_culmen_length = mean(culmen_length_mm , na.rm = T)) %>%
ungroup()## # A tibble: 3 x 2
## sex avg_culmen_length
## <chr> <dbl>
## 1 FEMALE 42.1
## 2 MALE 45.9
## 3 <NA> 41.3
tidyverse style syntax meme
Small Note on the Pipe
Since R Version 4.1.0 Base R also provides a pipe.
It looks like this: \(|>\)
While it shares many similarities with the %>% there
are also some differences.
It’s beyond the scope of this workshop to go over it here so for the
sake of simplicity we will stick with the magrittr
pipe.